{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Motif Identification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Often times, a user may want to know which motifs that were learned by the autoencoder are associated or enriched with a certain label, such as an antigen specificity. We will go through how to identify motifs that are associated with a certain label in this tutorial. First let's load the Murine dataset which contains sorted cells from multiple antigen-specific populations and train the VAE."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "import sys\n",
    "sys.path.append('../../')\n",
    "from DeepTCR.DeepTCR import DeepTCR_U\n",
    "\n",
    "# Instantiate training object\n",
    "DTCRU = DeepTCR_U('Tutorial')\n",
    "\n",
    "#Load Data from directories\n",
    "DTCRU.Get_Data(directory='../../Data/Murine_Antigens',Load_Prev_Data=False,aggregate_by_aa=True,\n",
    "               aa_column_beta=0,count_column=1,v_beta_column=2,j_beta_column=3)\n",
    "\n",
    "#Train VAE\n",
    "DTCRU.Train_VAE(Load_Prev_Data=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Following training, we will run the following command to discover enriched motifs for our labels.We will pass in the group parameter which specifies which label to look for the motifs. A key argument passed to this function is whether we want to do the enrichment at a per-sequence or per-sample level. If we leave the default parameter 'by_samples' to False, we will look for enriched motifs in all sequences irregardless of the samples they came from. By setting the 'by_samples' parameter to True, we will require do the enrichment analysis at the sample level requiring an enrichment to be present in multiple samples. This tends to be a more robust way in the setting one has many samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Motif Identification Completed\n"
     ]
    }
   ],
   "source": [
    "DTCRU.Motif_Identification(group='Kb-SIY',by_samples=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The motifs can then be found in fasta files in the results folder underneath (label)(alpha/beta)Motifs. These fasta fiels can then be used with \"https://weblogo.berkeley.edu/logo.cgi\" for motif visualization."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Additionally, we can change the motif length size by changing the kernel parameter in the Train_VAE function. While changing the kernel value will generally not alter performance, it can be useful if one wants to look at different sized motifs. That being said, a 5-mer kernel can learn a 3-mer motif but a 3-mer kernel cannot learn a 5-mer motif."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
